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1  OBJECTIVES 


Current  social  network  analytic  methods  analyze  a  static  aggregate  graph,  which  at  provides  a  limited  view  of  the 
structure  and  behavior  of  real-world  social  networks.  Real-world  networks  are  dynamic:  they  evolve  over  time  as 
new  connections  form  between  individuals,  and  networks  themselves  act  as  a  substrate  for  the  flow  of  information  and 
influence.  Ignoring  dynamics  can  produce  a  distorted,  and  even  wrong,  view  of  who  the  important  individuals  are  in 
a  social  network,  what  is  the  nature  and  strength  of  the  connections  between  them,  and  what  are  the  communities  of 
similar  or  similarly-behaving  individuals.  The  erroneous  conclusions  reached  by  static  network  analysis  will  waste 
analysts’  time  and  resources. 

For  these  reasons,  we  proposed  to  develop  network  analysis  methods  that  directly  incorporate  time.  The  research 
had  two  major  threads: 

•  Understand  how  networks  evolve  over  time,  and  how  changes  in  topology  affect  evolution  of  influence  and 
groups 

•  Understand  the  impact  of  dynamics  and  network  flows  on  the  measurement  of  network  structure 

Progress  in  dynamic  network  analysis  was  hampered  by  scarcity  of  large-scale  network  data  sets  with  flne-grained 
temporal  resolution.  We  mainly  worked  with  two  dynamic  network  data  sets:  citations  data,  representing  citations 
between  physics  articles  over  a  period  of  100  years,  and  online  social  network  data  we  collected  from  social  media 
sites  such  as  Digg  and  Twitter.  While  some  of  the  models  and  observations  are  limited  to  these  particular  network  data 
sets,  we  believe  that  the  methods  and  approaches  we  developed  using  these  data  sets  will  generalize  to  other  dynamic 
networks. 

Below  we  describe  our  technical  approach  to  address  these  questions  and  signiflcant  accomplishments. 

2  DYNAMIC  NETWORK  ANALYSIS 

2.1  Measuring  Influence  in  Dynamic  Networks 

Centrality  measures  a  node’s  importance  or  influence  in  a  network.  Over  the  years  a  variety  of  measures  have  been 
proposed  for  node  centrality,  including  degree  centrality,  Katz  status  score  Cl,  alpha-centrality  O,  eigenvector  lO 
and  betweenness  centrality  d,  and  variants  based  on  random  walk,  such  as  PageRank  Q.  Consider,  speciflcally, 
alpha-centrality  as  deflned  by  Bonacich,  which  measures  the  total  number  of  paths  of  any  length  between  two  nodes 
i  and  j,  with  longer  paths  contributing  less  to  the  centrality  than  shorter  paths.  Let  A  be  the  adjacency  matrix  of  a 
network,  such  that  Aij  =  1  if  an  edge  exists  from  i  to  j  and  Aij  =  0  otherwise.  The  alpha-centrality  matrix  is  given 
by: 

C(a)  =  A^  +  a^A^  +  ... 

where  a  is  the  attenuation  factor  along  an  edge.  This  parameters  sets  the  length  scale  of  interactions.  For  a  =  0, 
alpha-centrality  takes  into  account  direct  edges  only  and  reduces  to  degree  centrality.  As  a  increases,  this  becomes 
a  more  global  measure,  taking  into  account  more  distant  interactions.  However,  a  is  bounded  by  the  inverse  of  the 
largest  eigenvalue  of  A.  As  a  ^  ^I'^max,  alpha-centrality  approaches  eigenvector  centrality  d.  In  numerous  works, 
we  showed  that  alpha-centrality  is  a  useful  measure  of  identifying  important  nodes  in  a  network  E El  Ena.  In  a 
dynamic  network,  where  edges  may  change  over  time,  the  notion  of  a  path  must  be  reflned  to  include  time.  To  this 
end,  we  deflned  dynamic  alpha-centrality  matrix  ifTTl.  which  considers  paths  over  time-dependent  edges  in  a  network. 

In  addition  to  dynamic  alpha-centrality,  we  introduced  two  new  measures  of  centrality  for  growing  networks.  The 
first,  effective  contagion  matrix  (121,  overcomes  the  recency  bias  of  centrality  measures  that  fail  to  recognize  important 
new  nodes  that  have  not  had  as  much  time  to  accumulate  links  as  their  older  counterparts.  The  second  approach  to 
dynamic  centrality  ca  extends  the  notion  of  a  time-dependent  paths  introduced  in  our  earlier  work  to  consider  all 
paths,  a  cascade,  emanating  from  a  node  in  a  dynamic  network.  Comparing  the  size  and  structure  of  cascades  flAj 
generated  by  two  nodes  enables  us  to  compare  them  in  importance. 

Figure[2a)  illustrates  our  idea.  A  seed  (red  node)  represents  an  established  paper  in  a  held  of  research.  The  paper’s 
influence  grows  over  time  as  new  papers  cite  it  and  are  later  cited  by  other  papers,  creating  a  cascade  of  citations  that 
can  be  traced  back  to  the  seed.  A  challenger  (blue  node)  is  a  paper  that  advocates  a  new  paradigm.  It  attracts  new 
citations  from  papers  shown  as  white  nodes  with  blue  background,  leaving  the  complement  cascade  (green  nodes) 
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Figure  1:  (a)  Information  disruption  by  a  challenger  in  an  information  cascade.  The  seed  of  an  established  paradigm, 
marked  in  red,  creates  a  cascade  as  it  is  cited  by  other  papers,  while  a  challenger,  marked  in  blue,  disrupts  the  cascade  of 
the  seed,  (b)  Disruption  of  the  cascade  of  the  seed  paradigm  (red)  by  the  challenger  paradigm  (blue)  can  be  visualized 
as  the  decline  of  ^  of  the  complement  cascade  (green),  (c)  An  example  cascade. 


containing  papers  in  the  cascade  of  the  seed  that  are  not  connected  to  the  challenger.  When  the  challenger  represents 
a  non-competing  idea,  though  there  will  be  papers  that  cite  both  seed  and  challenger,  they  will  not  interfere  with 
the  growth  of  the  seed’s  cascade.  In  contrast,  a  transformative  challenger  will  disrupt  the  growth  of  influence  of  the 
established  paper.  Without  considering  the  challenger,  it  may  appear  that  the  established  paper  continues  to  prosper, 
as  its  cascade  continues  to  grow,  but  subtracting  part  of  the  cascade  taken  over  by  the  challenger  will  reveal  that  the 
growth  of  the  complement  cascade  (green  nodes)  slows.  In  this  case,  the  community’s  attention  shifts  to  the  challenger 
paradigm. 

We  briefly  summarize  the  approach.  For  additional  details  please  see  (OJ.  Given  one  or  more  papers  5  in  a 
citations  graph,  a  cascade  C  is  a  subgraph  that  contains  all  citation  chains  that  end  at  S.  The  set  S  is  called  the  seed 
or  root  of  the  cascade.  The  seed  indirectly  exerts  influence  on  all  papers  in  the  cascade,  but  influence  decays  with 
the  distance  to  the  seed,  just  like  in  alpha-centrality  Q.  For  a  node  j  in  the  cascade,  the  cascade  generating  function 
(t){j)  summarizes  the  structure  of  the  cascade  OH,  i  e.,  all  existing  citation  chains.  The  cascade  generating  function 
quantifies  the  influence  of  S  on  node  j,  and  is  defined  recursively  by 


<A0)  := 


^iecite{j) 


a(t){i) 


if  j  ^  ^ 
otherwise, 


(1) 


where  a  is  a  constant  damping  factor.  Figure  [^c)  shows  an  example  cascade  and  the  0  values  for  its  nodes.  For  a 
paper  j  published  after  T  time  steps  (e.g.,  years)  from  the  publication  of  the  seed,  0(j)  can  be  written  as  follows: 


=  X!  “p 


(2) 


where  the  coefficient  Op  is  the  number  of  distinct  paths  of  length  p  from  one  of  the  seeds  to  j.  The  impact  of  a  is  that 
the  smaller  the  value  of  a,  the  higher  the  penalty  against  long  paths.  It  is  also  possible  to  assign  a  unique  for  each 
link  but  we  found  that  it  is  simpler  to  assign  a  constant  0.5  for  all  links  to  control  its  impact. 

We  quantity  this  decline  by  the  disruption  score  S{r),  which  is  a  function  of  the  time  interval  of  r  given  the  seed 
and  challenger  cascades.  Let  to  be  the  publication  time  of  the  challenger  paper. 


•=  IZ  log 


t=to 


MC) 


to-\-T 


=  Y,  (log$t(C)-log$t(C)). 


t=to 


Approved  for  Public  Release;  Distribution  Unlimited. 
2 


Table  1:  Top  ten  challengers  to  the  1957  “Theory  of  Superconductivity”  identified  by  (a)  proposed  method  and  (b) 
baseline. 


Year 

Cites 

Title 

(a)  our  method:  sorted  by  disruption  score 

1958 

14 

Meissner  Effect 

1958 

307 

Random-Phase  Approximation  ...  Superconductivity 

1959 

40 

Evidence  for  Anisotropy  of  the  Superconducting  Energy... 

1989 

574 

Phenomenology  of  ...Cu-0  high-temperature  supercon... 

1987 

368 

Antiferromagnetism  in  La2Cu04_^ 

1987 

281 

Two-dimensional  antiferromagnetic  quantum  ... 

1988 

149 

Ba2YCu307:  Electrodynamics  of  Crystals  ... 

1990 

156 

High-resolution  angle-resolved  photoemission  ... 

1988 

399 

Low-temperature  behavior  of  two-dimensional  quantum  ... 

1995 

95 

Momentum  Dependence  of  the  Superconducting  ... 

(c)  baseline:  sorted  by  citations 

1981 

3191 

Self-interaction  correction  to  density-functional  approx... 

1996 

3088 

Generalized  Gradient  Approximation  Made  Simple 

1980 

2651 

Ground  State  of  the  Electron  Gas  by  a  Stochastic  Method 

1976 

2569 

Special  points  for  Brillouin-zone  integrations 

1996 

2387 

Efficient  iterative  schemes  for  ab  initio  total-energy... 

1990 

1951 

Soft  self-consistent  pseudopotentials  in  a  generalized... 

1991 

1950 

Efficient  pseudopotentials  for  plane-wave  calculations 

1975 

1597 

Linear  methods  in  band  theory 

1992 

1567 

Atoms,  molecules,  solids,  and  surfaces:... 

1992 

1445 

Accurate  and  simple  analytic  representation... 

The  disruption  score  can  be  visualized  as  the  area  between  the  red  and  green  curves  in  Figure [Jb)  from  to  to 
The  disruption  score  allows  us  to  identify  and  measure  the  impact  of  the  challenger  paper. 

We  applied  the  disruption  score  to  identify  transformative  physics  articles  published  by  the  American  Physical 
Society  (APS).  Through  several  case  studies  we  showed  that  the  proposed  method  is  better  able  to  identify  successful 
challengers  than  alternative  baseline  that  considers  the  number  of  citations  received  by  the  paper.  Further,  we  demon¬ 
strated  that  our  method  identifies  more  relevant  challengers  than  baseline.  Moreover,  challenger’s  success  is  evident 
early  on,  allowing  for  early  detection  of  transformative  research. 

2.1.1  Case  Study. 

In  1957  Bardeen,  Cooper  and  Schrieffer  published  a  seminal  paper  titled  “Theory  of  Superconductivity”  which  ex¬ 
plained  the  mechanism  by  which  some  metals  became  perfect  electrical  conductors  (i.e.,  they  lost  their  electrical 
resistance)  at  low  temperatures.  The  authors  were  awarded  a  Nobel  prize  for  this  discovery  in  1972.  This  paper  is  one 
of  the  ten  most  cited  papers  in  the  APS  dataset. 

Table [2 lists  the  ten  top-ranked  challengers  identified  by  our  method  and  the  baseline  (number  of  citations).  Com¬ 
pared  to  baseline,  our  method  identifies  papers  that  are  relevant  to  the  topic  of  superconductivity.  All  ten  of  the  top 
challengers  identified  by  baseline  are  papers  dealing  with  calculations  of  electronic  structure  of  materials,  and  include 
other  most-cited  papers  in  the  APS  dataset.  While  this  is  a  very  important  topic,  it  is  only  peripherally  related  to 
superconductivity,  in  as  much  as  this  phenomenon  is  a  result  of  correlated  electron  pairs. 

The  proposed  method  discovered  papers  on  high  temperature  superconductivity  (HTS).  The  discovery  of  HTS 
was  an  important  development  in  the  study  of  superconductivity,  recognized  with  a  Nobel  prize  in  1987.  Although 
the  original  paper  announcing  the  discovery  is  not  in  our  dataset,  presence  of  several  other  papers  on  HTS  among 
the  top  challengers  demonstrates  the  efficacy  of  our  method  to  identify  disruptive  papers.  These  challengers  include 
“Antiferromagnetism  in  La2Cu04_y”,  “Two-dimensional  antiferromagnetic  quantum  spin-fiuid  state  in  La2Cu04”, 
“Ba2YCu307:  Electrodynamics  of  Crystals  with  High  Reflectivity”  and  “Momentum  Dependence  of  the  Supercon¬ 
ducting  Sr2CaCu208”. 
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Figure  2:  State  diagram  of  a  user  v’s  behavior  regarding  to  another  user  u  who  is  in  the  screen  of  v.  The  ovals  represent 
parameters  which  are  measured,  while  circles  represent  parameters  to  be  estimated  from  data.  To  simplify  the  model, 
only  the  filled  circles  which  represent  the  hyper-parameters  are  to  be  finally  estimated  from  data. 


2.1.2  Relevant  Publications. 

•  Y.  hung  Huang,  C.-N.  Hsu,  and  K.  Lerman.  Identifying  transformative  scientific  research.  In  Proc.  of  IEEE 
International  Conference  on  Data  Minings  2013. 

•  R.  Ghosh  and  K.  Lerman.  A  framework  for  quantitative  analysis  of  cascades  on  networks.  In  Proceedings  of 
Web  Search  and  Data  Mining  Conference  (WSDM),  February  2011. 

•  R.  Ghosh  and  K.  Lerman.  Parameterized  centrality  metric  for  network  analysis.  Physical  Review  E,  83(6):0661 18, 
June  2011. 

•  R.  Ghosh,  T.-T.  Kuo,  C.-N.  Hsu,  S.-D.  Lin,  and  K.  Lerman.  Time-aware  ranking  in  dynamic  citation  networks. 
In  COMMPER  2011:  Mining  Communities  and  People  Recommendations,  Data  Mining  Workshops  (ICDMW), 
2010  IEEE  International  Conference  on,  pages  373  -380,  I>ecember  2011. 

•  R.  Ghosh  and  K.  Lerman.  Predicting  influential  users  in  online  social  networks.  In  Proceedings  ofKDD  work¬ 
shop  on  Social  Network  Analysis  (SNA-KDD),  July  2010. 

•  K.  Lerman,  R.  Ghosh,  and  J.-h.  Kang.  Centrality  metric  for  dynamic  network  analysis.  In  Proceedings  ofKDD 
workshop  on  Mining  and  Learning  with  Graphs  (MLG),  July  2010. 

2.2  Link  Prediction  in  Dynamic  Networks 

A  core  task  of  social  network  analysis  is  to  predict  the  formation  of  new  links  between  nodes.  In  the  context  of  social 
media,  link  prediction  serves  as  the  foundation  for  forecasting  the  evolution  of  the  follower  graph  and  predicting 
interactions  and  the  flow  of  information  between  users.  Previous  link  prediction  methods  have  generally  represented 
the  social  network  as  a  graph  and  leveraged  topological  and  semantic  measures  of  similarity  between  two  nodes  to 
evaluate  the  probability  of  link  formation.  We  proposed  a  link  creation  mechanism  for  social  media  ||T51  wherein  a 
person  v  creates  a  link  to  person  u  after  seeing  u’s  name  on  his  or  her  screen.  In  other  words,  visibility  of  a  user 
(name)  is  a  necessary  condition  for  new  link  formation.  This  work  illustrates  our  approach  of  uncovering  mechanisms 
driving  behavior  in  social  networks  and  building  models  based  on  these  mechanisms. 

We  proposed  a  visibility-based  model  for  link  prediction,  which  estimates  the  probability  of  a  user  views  another 
user’s  name,  and  used  this  model  to  predict  evolution  of  the  social  media  follower  graph.  Figure  shows  the  process 
through  which  a  user  v  from  a  population  views  another  user  u,  notices  u  and  then  responds  to  u  by  deciding  to  follow 
u.  When  a  user  v  visits  a  social  media  site  to  view  her  stream,  she  may  view  a  limited  portion  of  her  stream,  hence 
seeing  only  a  limited  number  of  screen  names.  The  likelihood  v  will  see  u’s  name  in  his  or  her  stream  depends  on  the 
total  number  of  screen  names  in  v’s  stream,  the  frequency  v  visits  the  social  media  site,  and  as  well  as  the  average 
number  of  names  v  views  per  visit  Once  v  notices  u’s  name  with  a  certain  probability  Pu{v),  v  might  create  new 
“follow”  link. 
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We  created  a  stochastic  model  that  describes  the  process  of  discovering  users  shown  in  Fig.  and  estimated  its 
parameters  by  a  Maximum-Likelihood  approach.  We  applied  the  model  to  study  dynamics  of  the  TWitter  follower 
graph  by  predicting  new  follower  links.  We  used  two  popular  metrics  to  evaluate  the  link  prediction  results:  the  Area 
Under  ROC  Curve  (AUC)  and  mean  average  precision  (MAP).  ROC  curve  is  created  by  plotting  the  fraction  of  true 
positives  out  of  the  total  actual  positives  (true  positive  rate)  vs.  the  fraction  of  false  positives  out  of  the  total  actual 
negatives  (false  positive  rate),  at  various  threshold  settings.  Thus,  AUC  score  evaluates  the  overall  ranking  yielded  by 
the  algorithms  with  a  larger  AUC  indicating  better  link  prediction  performance. 


Figure  3:  Comparison  of  different  algorithms  on  the  task  of  predicting  follow  links  using  MAP  and  AUC  scores. 

We  compared  the  performance  of  the  proposed  model  to  six  different  baselines,  including  linear  regression  (LR), 
edge  rank  (ER),  preferential  attachment  (PA),  Adamic-Adar  (AA)  score,  latest  activity  (LL),  and  exposure  frequency 
(EF).  Figure  show  results  of  the  proposed  model  to  baselines  on  the  task  of  predicting  new  follow  links.  Note  that 
T-axis  denotes  the  testing  time  interval  and  all  the  available  data  before  testing  time  are  for  training.  The  proposed 
approach  SV  (solid  red  color)  outperforms  all  other  baselines  in  terms  of  both  MAP  and  AUC  scores. 

2.2.1  Relevant  Publications. 

•  L.  Zhu  and  K.  Lerman.  A  visibility-based  model  for  link  prediction  in  social  media.  In  Proceedings  of  the 
ASE/IEEE  Conference  on  Social  Computing,  2014. 

•  N.  O.  Hodas  and  K.  Lerman.  How  limited  visibility  and  divided  attention  constrain  social  contagion.  In 
ASE/IEEE  International  Conference  on  Social  Computing,  2012. 


3  DYNAMIC  PROCESSES  ON  NETWORKS 

The  major  focus  of  this  project  was  to  understand  how  dynamics  affected  our  understanding  and  measurement  of 
network  structure.  By  dynamics  we  mean  the  processes  by  which  nodes  in  a  network  interact,  e.g.,  to  share  money, 
ideas,  information,  or  influence  each  other.  For  example,  consider  a  social  network  in  which  people  exchange  money 
for  goods  and  services.  Each  person  takes  some  money  and  distributes  it  among  neighbors.  Since  no  money  is  created 
or  destroyed,  this  type  of  interaction  can  be  modeled  by  a  random  walk.  A  random  walk  is  a  process  where  at  each  step 
a  walker  chooses  a  random  neighbor  of  a  given  node  to  transition  to.  While  many  social  processes  can  be  described 
by  random  walks  (web  surfing,  used  goods  exchange,  phone  calls,  etc.)  many  cannot  be  thus  described.  Consider  an 
infectious  disease  spreading  between  people.  At  each  point,  rather  than  infecting  only  one  neighbor,  a  sick  person  can 
infect  (with  some  probability)  all  neighbors.  Information,  influence  and  epidemics  spread  in  such  non-conservative 
fashion. 

Taking  dynamics  into  account  changes  our  perception  of  network  structure.  We  use  a  simple  example  to  illustrate 
how  dynamics  affects  who  the  central  nodes  are  and  what  communities  exist  in  a  network.  A  central  node  in  a  money 
exchange  network  is  one  who  frequently  receives  sums  of  money  from  others.  If  this  were  an  illicit  exchange  network, 
law  enforcement  would  want  to  target  this  central  node  to  disrupt  the  network’s  activities.  A  central  node  in  an 
infectious  disease  network,  on  the  other  hand,  is  one  who  is  frequently  infected  by  others.  If  we  had  just  one  vaccine 
to  give,  we  have  to  give  it  to  this  central  node  to  best  stop  the  spread  of  the  disease.  If  the  same  network  has  both 
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a  vims  spreading  on  it  and  illicit  money  exchange  taking  place,  then  the  node  we  will  want  to  vaccinate  will  almost 
certainly  not  be  the  same  as  the  one  that  would  be  targeted  by  law  enforcement.  Similarly,  if  we  define  a  community 
in  a  money  exchange  network  as  a  group  of  nodes  who  frequently  exchange  money  with  each  other,  it  most  certainly 
won’t  be  the  same  group  that  frequently  infects  each  other  with  a  virus.  Below  we  clarify  and  formalize  the  impact  of 
dynamic  processes  on  network  structure. 

3.1  Dynamics  and  Centrality 

Existing  centrality  measures  examine  link  structure  of  the  network  to  identify  key  individuals  within  it.  However,  as 
we  argued  previously,  centrality  is  intimately  related  to  dynamic  processes  taking  place  on  the  network,  processes 
which  determine  how  information,  disease  or  goods  flow  on  the  network.  For  example,  by  modeling  Web  surfing 
as  a  random  walk,  PageRank  assigns  a  score  to  each  Web  page  based  on  its  value  in  the  equilibrium  distribution  of 
the  random  walk.  In  contrast,  central  individual  in  a  social  network  through  which  a  disease  is  spreading  is  one  who 
infects  most  others.  Such  infiuential  individuals  are  scored  highly  by  the  Katz  index  or  Bonacich’s  Alpha-Centrality, 
both  of  which  give  the  equilibrium  distribution  of  an  epidemic  process  on  a  network  Ema. 

Now  consider  information  spreading  through  a  population,  for  instance,  by  users  sending  messages  or  product 
recommendations  to  their  friends  using  email  or  social  media.  We  have  demonstrated  recently  that  information  spread 
cannot  be  modeled  as  an  epidemic  diffusion.  Instead,  cognitive  constraints,  such  as  limited  attention  are  important  QSl. 
Attention  is  the  psychological  mechanism  that  controls  how  we  process  incoming  stimuli  and  decide  what  activities  to 
engage  in.  Actions,  such  as  reading  a  tweet,  browsing  a  Web  page,  or  reading  a  science  article,  require  mental  effort, 
and  since  human  brain’s  capacity  for  mental  effort  is  limited,  so  is  attention.  As  a  consequence,  the  more  stimuli 
people  have  to  process,  the  smaller  the  probability  they  will  respond  to  any  one  stimulus. 

Cognitive  constraints  the  nature  of  social  interactions  and  therefore,  how  central  nodes  are  identified.  Now  a  node’s 
capacity  to  infect  others  depends  not  only  on  how  many  connections  it  has  but  also  on  who  and  how  many  others 
these  nodes  are  connected  to.  We  have  recently  introduce  a  new  centrality  for  social  networks  —  limited- attention 
Alpha-Centrality  (la AC)  —  that  model  attention-limited  nature  of  social  interactions  and  provide  their  mathematical 
definitions.  We  also  developed  fast  approximate  algorithm  to  calculate  this  measure  on  large  graphs  and  provided  its 
performance  guarantees. 

Alpha-Centrality  measures  the  total  number  of  paths  from  a  node,  exponentially  attenuated  by  their  length.  Bonacich 
introduced  this  measure  as  a  generalization  of  the  index  of  status  proposed  by  Katz,  and  it  is  sometimes  referred  to 
as  Bonacich  centrality.  Alpha-Centrality  vector  cr(a,  5)  can  be  defined  iteratively  in  terms  of  adjacency  matrix  of  the 
graph  A\ 

cr(a,  5)  =  5  +  aA  •  cr(a,  5),  (3) 

where  the  starting  vector  s  =  Ae^  is  taken  as  out-degree  centrality  1^. 

Alpha-Centrality  gives  the  steady  state  distribution  of  an  epidemic  process  on  a  network  ca,  where  a  is  the 
probability  to  transmit  a  message  or  influence  along  a  link.  Therefore,  ith  entry  of  cr  can  be  interpreted  as  the  number 
of  infections  directly  or  indirectly  caused  by  node  i  (see  attached  paper  for  more  details). 


Figure  4:  Directed  network  with  sizes  of  nodes  weighed  by  their  score  according  to  (a)  Alpha-centrality  and  (b) 
limited-attention  Alpha-centrality  of  the  influence  graph. 
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Let  us  now  consider  the  case  in  which  a  node’s  capacity  to  receive  incoming  stimuli  —  whether  messages  or 
viruses  —  is  limited.  While  attention  need  not  be  distributed  uniformly  over  friends  —  some  friends  may  receive  a 
greater  share  of  a  person’s  attention  due  to  familiarity,  trust,  social  closeness,  or  influence  —  for  simplicity,  we  assume 
that  each  friend  receives  the  same  fraction  of  a  person’s  attention.  Therefore,  the  probability  that  node  j  will  receive  a 
message  broadcast  by  i  will  be  proportional  to  1/ din  {j ) ,  where  din  {j )  is  the  in-degree  of  node  j.  The  limited- attention 
Alpha-Centrality  matrix  can  be  written  in  terms  of  the  modified  adjacency  matrix  M  =  as: 

Cia  =  M  ^  aM‘^  +  a‘^M^  +  +  . . . 

The  limited-attention  Alpha-Centrality  vector  ^^cr(a,  s)  can  also  be  written  in  iterative  form: 

s)  =  s  ^  ^AD^^  cr(a,  5),  (4) 

with  the  starting  vector  s  =  AD^^e^ .  Figures  [^illustrates  the  differences  between  Alpha-Centrality  and  its  limited- 
attention  variant. 

Note  that  we  have  developed  fast  approximate  algorithms  to  compute  Alpha-Centrality  and  limited-attention 
Alpha-Centrality. 


Figure  5:  Correlation  of  rankings  of  (a)  Digg  and  (b)  Twitter  users  found  by  different  measures  of  centrality  with  the 
empirical  infiuence  ranking. 

We  evaluated  the  performance  of  the  centrality  measures  on  the  task  of  identifying  infiuential  users  of  large  net¬ 
works  from  social  media  sites  containing  hundreds  of  thousands  of  users.  However,  in  order  to  evaluate  the  perfor¬ 
mance  of  centrality,  we  need  a  relevant  measure  of  infiuence.  User  activity  provides  us  with  an  empirical  measure  of 
infiuence.  When  a  user  posts  a  URL  on  Digg  or  Twitter,  she  broadcasts  it  to  all  her  followers.  We  refer  to  this  user  as 
the  submitter.  Whether  or  not  her  follower  will  re-broadcast  the  URL  (i.e.,  retweet  it  on  Twitter  or  vote  for  it  on  Digg) 
depends  on  its  quality  and  submitter's  influence.  Assuming  that  URL’s  quality  is  uncorrelated  with  the  submitter,  we 
can  average  out  its  effect  by  aggregating  over  all  URLs  submitted  by  the  same  user.  The  residual  difference  in  the 
amount  of  attention  the  followers  pay  submitter  by  re-broadcasting  her  messages  can  be  attributed  to  variations  in 
submitter’s  infiuence.  Therefore,  we  use  the  average  number  of  times  the  URLs  submitted  by  the  user  are  re-broadcast 
by  her  followers  as  the  empirical  measure  of  influence. 

Figure  shows  how  well  the  rankings  produced  by  different  centralities  correlate  with  the  empirical  infiuence 
rankings  of  users  who  submitted  at  least  two  URLs  which  were  rebroadcast  at  least  ten  times.  We  use  Spearman  rank 
correlation  because  it  is  less  sensitive  to  variations  in  scores,  and  we  expect  some  variation  to  arise  in  approximate 
centrality  scores.  Limited- attention  Alpha-Centrality  correlates  better  with  the  empirical  measure  of  infiuence  than 
Alpha-Centrality  over  a  broad  range  of  a  values,  consistent  with  our  claim  that  la  AC  is  a  better  measure  for  predicting 
central  social  media  users,  because  it  better  models  the  dynamics  of  online  communication  than  AC.  On  Digg,  AC 
appears  to  outperform  I  a  AC  for  small  values  of  a.  Since  a  can  be  thought  of  as  the  scale  of  interaction,  this  implies 
that  locally,  AC  better  predicts  infiuential  users.  This  could  be  the  consequence  of  the  fact  that  our  measure  of 
infiuence,  i.e.,  number  of  re-broadcasts  by  followers,  is  a  local  measure.  In  the  future,  we  plan  to  compare  the 
performance  of  centrality  measures  using  a  global  measure  of  infiuence,  for  example,  the  average  size  of  cascades 
triggered  by  submitted  URLs.  We  did  not  expect  limited- attention  PageRank  (laPR)  (described  in  the  accompanying 
paper)  to  predict  infiuence  rankings  of  Digg  and  Twitter  users,  since  the  dynamic  process  this  centrality  models  does 
not  at  all  describe  communication  patterns  of  social  media  users,  and  we  found  no  correlation. 
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3.1.1  Relevant  Publications. 

•  R.  Ghosh  and  K.  Lerman.  Rethinking  centrality:  The  role  of  dynamical  processes  in  social  network  analysis. 
Discrete  and  Continuous  Dynamical  Systems  Series  B,  19(5):  1355  -  1372,  July  2014. 

•  Lerman,  K.;  Jain,  R;  Ghosh,  R.;  Kang,  J.;  and  Kumaraguru,  P.  Limited  Attention  and  Centrality  in  Social 
Networks  In  Proceedings  of  International  Conference  on  Social  Intelligence  and  Technology  (SOCIETY),  2013. 

•  R.  Ghosh  and  K.  Lerman.  Parameterized  centrality  metric  for  network  analysis.  Physical  Review  E,  83(6):0661 1 8, 
June  2011. 

3.2  Dynamics  and  Communities 

Dynamics  processes  also  affect  the  emergent  communities.  We  illustrate  on  a  simple  example  of  opinion  formation. 
Imagine  a  network  of  interacting  agents,  each  holding  an  opinion.  Each  interaction  causes  agents’  opinions  to  become 
more  similar.  As  the  network  evolves  over  time,  opinions  of  agents  within  the  same  community  converge  faster  than 
those  of  other  agents.  This  framework  allows  us  to  study  how  network  topology  and  interactions,  which  mediate 
the  transfer  of  opinions  between  agents,  both  affect  the  formation  of  communities.  In  traditional  models  of  opinion 
dynamics,  agents  interact  via  one-to-one  opinion  transfer.  Such  conservative  interactions  can  be  modeled  as  random 
walks.  However,  social  interactions  are  often  non-conservative,  resulting  in  one-to-many  transfer  of  opinions.  These 
interactions  result  in  different  emergent  patterns  of  consensus  —  or  communities  of  agents  holding  the  same  opinion. 

We  simulated  dynamics  of  opinion  formation  in  the  real-world  networks  of  the  Digg  follower  graph  and  Eacebook 
friendship  graph.  We  simulated  two  different  interaction  models:  one-to-one  and  one-to-many  interactions.  Both 
types  of  models  revealed  similar  “core  and  whiskers”  community  structure  in  each  network,  with  a  giant  core  and 
small  communities,  or  whiskers,  loosely  attached  to  the  core.  Eurthermore,  this  structure  was  multi-scale.  Isolating 
the  core,  and  simulating  dynamics  of  opinion  formation  within  it  revealed  finer-grained  structure,  with  another  core 
and  small  “whiskers”  attached  to  it.  However,  the  composition  of  the  giant  cores  identified  by  the  two  interaction 
models  were  very  different. 


resolution  scale  resolution  scale 


(a)  (b) 

Eigure  6:  Properties  of  small  communities  found  in  the  Digg  mutual  follower  graph  by  the  two  interaction  models,  (a) 
Number  of  small  communities  at  different  resolutions  scales,  (b)  Average  number  of  co- votes  made  by  community 
members. 

Eigure  l^a)  shows  the  number  of  small  communities  (whiskers)  resolved  by  the  two  interaction  models  at  different 
resolution  scales,  measured  by  closeness  to  the  center  of  the  core.  One-to-many  (non-conservative)  interaction  model 
assigned  many  more  users  to  such  communities  than  the  one-to-one  (conservative)  interaction  model.  The  rest  of  the 
users  fragmented  into  isolated  pairs  or  singletons,  who  did  not  synchronize  their  opinions  with  any  others.  How  does 
the  quality  of  the  discovered  communities  differ?  We  measure  similarity  of  two  Digg  users  by  the  number  of  stories 
for  which  they  both  voted,  i.e.,  co-votes.  Then,  averaging  over  co-votes  of  all  connected  pairs  of  community  members, 
we  obtain  a  measure  of  community  “cohesiveness.”  As  seen  in  Eigure  [^b),  average  number  of  co-votes  increases  at 
finer  resolution  scales,  producing  more  cohesive  communities  in  the  center  of  the  core. 
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Figure  7 :  Properties  of  small  communities  found  in  the  Facebook  network  of  American  University  by  the  two  inter¬ 
action  models.  Each  plot  shows  at  different  resolution  scales  the  probability  of  occurrence  of  the  most  frequent  value 
of  user  features  (a)  major,  (b)  dorm,  (c)  year,  (d)  category  of  individual.  Conservative  and  non-conservative  refer  to 
one-to-one  and  one-to-many  interactions  respectively. 


How  do  the  small  communities  discovered  at  different  resolution  scales  by  the  two  interaction  models  in  the  Face- 
book  social  network  differ?  We  look  at  four  features  of  users  in  the  data  set  —  major,  dorm,  year  and  category  of 
individual  —  and  calculate  the  prevalence  of  feature  values  among  community  members.  The  community  is  char¬ 
acterized  by  the  prevalence  of  the  most  popular  feature  among  its  members,  or  its  cohesiveness  with  respect  to  that 
feature.  For  example,  when  using  the  dorm  feature  to  characterize  the  community,  dorm  cohesiveness  is  the  largest 
fraction  of  community  members  that  belong  to  the  same  dorm.  Figure  [7]  shows  the  cohesiveness  of  communities 
found  by  the  two  interaction  models  at  different  resolution  scales  with  respect  to  some  feature  generally  increase  at 
finer  resolution  scale.  This  suggests  that  individuals  in  the  center  of  the  core  are  far  more  similar  to  each  other  than 
peripherally  connected  individuals.  More  importantly,  the  characteristics  of  the  community  structure  discovered  by 
conservative  and  non-conservative  interaction  models  vary  significantly. 

3.2.1  Relevant  Publications. 

•  R.  Ghosh  and  K.  Herman.  The  Impact  of  Network  Flows  on  Community  Formation  in  Models  of  Opinion 
Dynamics,  to  appear  in  J.  of  Mathematical  Sociology. 

•  L.  M.  Smith,  K.  Herman,  C.  Garcia-Cardona,  A.  G.  Percus,  and  R.  Ghosh.  Spectral  clustering  with  epidemic 
diffusion.  Physical  Review  E,  88(4):042813,  Oct.  2013. 

•  K.  Herman  and  R.  Ghosh.  Network  Structure,  Topology  and  Dynamics  in  Generalized  Models  of  Synchroniza¬ 
tion.  Physical  Review  E,  86(026108),  2012. 
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3.3  Dynamics  and  Link  Prediction 

As  described  above,  predicting  new  links  in  networks  is  a  critical  capability,  both  for  the  military  and  for  commercial 
applications,  such  as  product  and  social  recommendation.  Link  prediction  heuristics  take  network’s  current  structure 
into  account  to  predict  new  links  that  will  form  between  existing  nodes.  A  variety  of  link  prediction  heuristics  have 
been  proposed,  including  neighborhood  overlap,  the  Adamic-Adar  score,  which  weighs  the  contribution  of  each  com¬ 
mon  neighbor  by  the  inverse  of  the  logarithm  of  its  degree,  and  number  of  paths  connecting  the  two  nodes.  In  general, 
link  prediction  heuristics  consider  how  close  two  nodes  are  in  a  network.  In  the  money  exchange  network,  for  exam¬ 
ple,  two  individuals  can  be  considered  to  be  close  even  if  they  don’t  know  each  other,  if  the  money  distributed  by  one 
is  often  received  by  the  other.  In  other  words,  the  probability  that  a  random  walk  originating  at  one  node  reaches  the 
other  measures  the  proximity  of  two  nodes  in  the  network. 

We  argue  that  proximity  should  take  into  account  the  ability  to  exchange  information  or  to  influence  with  each 
other.  This  is  determined  by  the  dynamic  processes  taking  place  on  the  network,  i.e.,  the  processes  by  which  informa¬ 
tion  or  influence  is  transmitted  from  one  node  to  another. 

In  this  project,  we  unifled  link  prediction  heuristics  by  viewing  them  as  instances  of  network  proximity  under 
different  dynamics,  and  introduced  new  ones  based  on  other  dynamic  processes.  The  heuristics  we  examined  are  listed 
in  Table|^  Note,  that  these  are  deflned  for  directed  networks,  where  Tin{v)  refers  to  the  in-neighbors  of  node  v,  i.e., 
din{v)  nodes  whose  edges  are  incident  on  v,  where  din{v)  is  the  in-degree  of  v,  and  similarly  for  the  out-neighbors. 
In  addition  to  existing  ones,  we  introduced  new  heuristics  based  on  node’s  limited  bandwidth.  Consider  a  process 
in  which  a  node’s  capacity  to  receive  incoming  messages  is  limited  by  its  bandwidth.  As  a  consequence,  the  more 
incoming  connections  (in-links)  a  node  has,  the  less  likely  it  is  to  receive  a  message  from  an  arbitrary  connection, 
e.g.,  because  it  has  already  reached  the  limit  of  its  capacity  by  processing  other  incoming  messages.  This  alters  the 
character  of  the  flow  and  leads  to  novel  measures  of  network  proximity,  that  we  call  limited-bandwidth  or  limited- 
attention  epidemics  or  random  walks. 

Table  2:  Heuristics  used  in  link  prediction  applications.  Popular  existing  link  prediction  heuristics  appear  above  the 
double  line:  number  of  common  neighbors,  Jaccard  and  Adamic-Adar  score,  and  resource  allocation.  Below  the 
double  line  are  link  prediction  heuristics  introduced  in  this  paper. 


name 

symbol 

definition 

common  neighbors 

CN 

CN=^^ 

|A|  +  |A'| 

Jaccard  score 

JC 

rout(w)nrin(v) 

■  +  ■ 

rout(^^)nrin(w) 

r out  ('lt)Ur in('i^) 

r out  ('i^)  UFin  (it) 

Adamic-Adar 

AA 

“  2  SzgA  los{d{z))  +  Sz'eA'  log(d(2') 

). 

resource  allocation 

RA 

RA  =  \ 

>  ^zeA  d{z)  ^  ^z'eA'  d{z') 

conservative 

(random  walk) 

CS 

nq  —  1  I 

2  ^zGA  dout(M)dout(z) 

,  1  1 

2  dout  ('^)c^OUt  (-2) 

limited-bandwidth 

conservative 

ICS 

2  SzeA  dout(u)din(z)doMt(z)dia(v) 

I  1  'p  1 

2  dout  ('^)c^in  ('2)<iout  ('2)c?in  ('W') 

non-conservative 

(epidemic) 

NC 

NC=^, 

L|A|  +  |A'|J 

limited-bandwidth 

non-conservative 

INC 

INC  —  2  diniz)din{v) 

2  din(z)din(u) 

hybrid  conservative 

hCS 

hCS=\ 

SzGA  doutC^)  '  '^zeA'  dov.t{z) 

hybrid  limited-bandwidth 
conservative 

hies 

hybrid  non-conservative 

INC 

hybrid  limited-bandwidth 

hlNC 

non-conservative 

Approved  for  Public  Release;  Distribution  Unlimited. 
10 


Table  3:  Networks  studied  in  the  missing  link  prediction  task  and  their  properties. 


network 

nodes 

edges 

missing 

density 

social  networks 

dolphins 

62 

159 

16 

0.084 

email 

1133 

5452 

545 

0.0085 

jazz 

198 

2742 

274 

0.14 

connect 

1095 

7825 

783 

0.014 

hep-th 

8710 

14254 

1425 

0.0003 

netscience 

1461 

2742 

274 

0.0013 

imdb 

6260 

98235 

9824 

0.005 

technological  networks 

us  air 

332 

2126 

212 

0.0193 

power  grid 

4941 

6594 

660 

0.0004 

biological  networks 

protein 

1870 

2277 

228 

0.0013 

c.  elegans 

453 

2040 

204 

0.02 

We  conducted  experiments  on  a  variety  of  networks  belonging  to  three  categories:  Social,  Technological  and 
Biological  networks.  Table  lists  some  of  the  statistics  of  the  datasets.  We  evaluated  the  performance  of  link 
prediction  heuristics  on  the  missing  link  prediction  task  in  these  networks.  Since  all  the  networks  studied  here  are 
undirected,  some  of  the  heuristics  are  mathematically  equivalent:  CS  =  INC,  RA  =  hCS  =  hlNC,  and  CN  = 
NC  =  hNC. 

We  ran  several  trials  of  the  link  prediction  task  for  each  network.  In  each  trial,  first,  we  randomly  remove  10% 
of  all  edges  and  assign  them  to  the  test  set  Etest-  The  remaining  90%  of  links  comprise  the  training  set,  the  graph 
G train  =  {V,  Strain)-  We  then  compute  network  proximity  using  a  given  link  prediction  heuristic  for  all  pairs  of 
nodes  \V  x  V  —  Efrainl  and  rank  them  in  decreasing  order.  We  score  the  prediction  based  on  how  many  of  top-M 
predicted  edges  are  correct.  This  allows  us  to  compute  a  curve  showing  precision® M,  the  ratio  of  the  number  of 
correctly  predicted  links  within  the  M  edges  with  the  highest  score. 


f  ^  & 

link  prediction  heuristic 


■  dolphins 

■  imdb 

■  hep-th 

■  netscience 

■  email 

■  jazz 

■  connect 


f  & 


m  protein 
■  c.elegans 


link  prediction  heuristic 


■  power  grid 

■  us  air 


link  prediction  heuristic 


social  networks 


biological  networks 


technological  networks 


Figure  8:  Aggregated  performance  of  different  link  prediction  heuristics. 

We  compare  the  performance  of  different  link  prediction  heuristics  by  measuring  the  area  under  the  precision 
curve.  Figure  aggregates  these  measure  across  all  datasets  within  each  domain,  giving  us  a  sense  of  their  relative 
performance.  First  thing  we  note  is  that  there  is  a  wide  variation  in  performance  of  different  link  prediction  metrics,  we 
a  popular  Jaccard  similarity  performing  poorly  in  many  networks.  However,  some  metrics  perform  consistently  better: 
RA,  A  A  and  hlCS  are  consistently  among  the  top  performing  measures.  RA  measure  is  related  to  random- walk 
based  measure  CS  and  limited-attention  epidemic,  and  both  A  A  and  hlCS  are  variants  of  this  measure.  Though  we 
cannot  confirm  whether  it  is  the  random  walk  or  the  limited  attention  that  leads  to  better  performance  (since  random 
walk  is  mathematically  equivalent  to  a  limited- attention  epidemic  in  an  undirected  graph),  this  study  indicates  that  a 
practioner  should  be  careful  about  choosing  an  appropriate  metric  for  the  task,  one  that  reflects  the  phenomena  taking 
place  in  the  network. 
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3.3.1  Relevant  Publications. 

•  Narang,  K.;  Lerman,  K.;  and  Kumaragum,  P.  Network  Flows  and  the  Link  Prediction  Problem  In  Proceedings 
ofKDD  workshop  on  Social  Network  Analysis  (SNA-KDD),  2013. 

3.4  Generalized  Laplacian  Framework 

Our  major  accomplishment  was  the  mathematical  framework  to  model  dynamics  on  networks.  This  framework  unifies 
several  well  known  centrality  and  community  measures  under  a  single  model.  In  this  dynamics -oriented  view,  a  node’s 
centrality  describes  its  participation  in  the  dynamical  process  taking  place  on  the  network.  Similarly,  communities  are 
groups  of  nodes  that  interact  more  frequently  with  each  other  m 

The  mathematical  framework  we  developed  is  general  and  fiexible,  able  to  represent  a  variety  of  dynamical  pro¬ 
cesses.  At  the  core  of  this  framework  is  the  generalized  Laplacian  matrix: 

C  =  {TDw)~^^‘^~^{Dw  -  W){DwT)-^/‘^+P .  (5) 

Compared  with  the  traditional  symmetric  normalized  Laplacian  generalized  Laplacian  has 

three  additional  parameters  corresponding  to  different  linear  transformations.  These  transformations  relate  the  differ¬ 
ent  dynamical  processes  to  the  random  walk. 

1 .  p\  Similarity  transformation.  It  is  an  equivalence  relation  on  the  space  of  square  matrices,  leading  to  seemingly 

unrelated  formulations  which  are  in  fact  the  same  dynamics  under  different  basis.  While  p  technically  can  be 
any  real  number,  in  this  work  we  limit  ourselves  to  three  special  cases:  p  =  1/2,  0,— 1/2.  These  cases  corre¬ 
spond  to  three  equivalent  formulations  we  shall  call  “consensus”(/I^^^),  “symmetric”(/^‘^^^)  and  “random 
walk”(/^^^)  respectively.  While  we  use  for  mathematical  convenience,  it  is  often  more  intuitive  to 

think  from  the  random  walk  or  consensus  perspective. 

2.  T:  Scaling  transformation.  T  is  the  n  x  n  diagonal  matrix  of  vertex  delay  factors .  Its  Ah  element  represents 

the  average  delay  of  vertex  i.  Without  loss  of  generality,  we  assume  that  >  1,  for  all  i  e  V.  Scaling 
transformation  can  be  understood  as  rescaling  the  local  delay  at  each  vertex  i  by  r^,  with  the  dynamical  process’s 
waiting  times  between  jumps  exponentially  distributed  as  the  PDF  /(t,  t)  =  . 

3.  W\  Reweighing  transformation  that  gives  new  weights  to  edges  of  the  network.  We  use  the  interaction  matrix 

W  instead  of  the  adjacency  matrix  A.  Note  that  the  degree  matrix  is  now  also  defined  in  terms  of  W ,  that  is 
dwi  =  While  the  scaling  transformation  changes  the  delay  at  each  vertex,  reweighing  transformation 

changes  the  trajectory  of  a  dynamic  process.  For  example,  a  biased  random  walk  with  transition  probability 
Pij  oc  aiAij  is  equivalent  to  an  unbiased  random  walk  on  the  reweighed  “interaction  graph”  W  with  entries 

y^ij  —  OLiAijOLj  . 

The  generalized  Laplacian  framework  is  fiexible  enough  to  capture  a  variety  of  well-known  processes,  such  as 
random  walks  and  epidemics,  but  also  describe  less-studied  processes. 

Normalized  Laplacian  If  the  interaction  matrix  is  the  adjacency  matrix  W  =  A  and  vertex  delay  factor  is  the 
identity  T  =  I,  with  p  =  0  we  recover  the  symmetric  normalized  Laplacian: 

CSYM  ^  j  _ 

Under  similarity  transformations,  normalized  Laplacian  is  equivalent  to  some  well  studied  dynamical  processes.  For 
example,  by  setting  p  =  —1/2  we  have  the  unbiased  random  walk 

^RW  ^jj-1 

where  AD~^  forms  a  stochastic  matrix  whose  ijth  entry  is  the  transition  probability  Pij.  Setting  p  =  1/2  we  have 
the  consensus  formulation 

d^oN  =  I_D-^A  , 

where  each  vertex  updates  its  “belief’  based  on  the  weighted  average  “beliefs”  of  its  neighbors. 
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(Scaled)  Graph  Laplacian  When  W  =  A,  T  =  dmaxD  the  generalized  Laplacian  operator  corresponds  to  the 
(scaled)  graph  Laplacian 

C  =  l/dmax{D-A). 

Notice  that  by  setting  T  =  dmaxD~^,  the  diagonal  matrix  TD\y  becomes  effectively  a  scalar.  As  a  result,  different 
similarity  transformation  (changing  p)  lead  to  identical  linear  operators.  This  operator  is  often  used  to  describe  heat 
dijfusion  processes  and  its  interpretation  is  used  for  distributed  calculation  of  arithmetic  means. 

Replicator  Let  v  be  the  eigenvector  of  A  associated  with  its  largest  eigenvalue  Xmax'-  =  Xmax'^-  We  can  then 
construct  a  diagonal  matrix  V  whose  elements  are  the  components  of  the  eigenvector  v.  Let  us  consider  the  interaction 
matrix  W  =  V AV  with  T  =  I  and  p  =  0: 

CSYM  =  /  _  =  I-  IjXmaxA 

This  operator  is  known  as  the  replicator  matrix  K,  and  it  models  epidemic  dijfusion  on  a  graph  (TtKTsI.  Setting 
p  =  —1/2  we  get  the  maximum  entropy  random  walk  on  the  original  graph  A. 

WD^  . 

Unbiased  Laplacian  Normalized  adjacency  matrix  is  known  as  TU  =  With  T  =  dwmaxD^  we 

define  the  unbiased  Laplacian  matrix: 

C  =  i/dwmaxi^w  -  W). 

Just  like  the  graph  Laplacian,  different  values  of  p  for  Unbiased  Laplacian  lead  to  the  same  operator.  Its 
interpretation  is  a  degree  based  biased  random  walk  with  Pij  oc  d-  '  A^ . 

These  four  special  cases  are  related  to  each  other  through  scaling  and  reweighing  transformations,  captured  by  the 
following  diagram. 


scaling 

Normalized  Laplacian - >  Laplacian 


reweighing 


Replicator 


Unbiased 


reweighing+scaling 

Laplacian 


Our  empirical  study  demonstrates  that  these  special  cases  and  different  transformations  in  general  can  lead  to 
divergent  views  about  who  the  central  vertices  are  and  what  are  the  corresponding  communities  (see  figures).  The 
above  diagram  helps  us  better  understand  how  different  measures  of  centrality  and  communities  relate  to  each  other 
under  the  generalized  Laplacian  framework. 

3.4.1  Generalized  Centrality. 

Centrality  captures  how  important  a  node  is  in  a  network.  Under  the  generalized  Laplacian  framework,  different 
centrality  measures  are  related  to  solutions  of  different  dynamical  processes  ca. 

Similarity  transformations  (different  values  of  p)  lead  to  the  same  state  vector  6  at  any  time  f  up  to  a  change  of 
basis.  Based  on  the  connection  between  centrality  measures  and  the  stationary  distribution  of  a  random  walk,  we 
generalize  the  definition  of  centrality  to: 

Ci=dwUi‘  (6) 

Generalized  centrality  reduces  to  some  well  known  centrality  measures  by  setting  the  parameters  T  and  W. 
They  can  now  be  systematically  compared  by  scaling  and  reweighing  transformations  between  special  cases  under 
the  generalized  Laplacian  framework.  They  include  degree  centrality  di  for  Normalized  Laplacian,  and  the  square  of 
eigenvector  centrality  v‘1  for  Replicator.  First  column  of  the  figures  illustrate  how  generalized  centrality  differ  in  four 
special  cases  on  the  same  network. 
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Generalized  Centrality 
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(a)  Centrality  profile 


(c)  Spectral  partition  using  Normalized  Laplacian 


(d)  Spectral  partition  using  Replicator 


Dynamics  — -  Normalized  Laplacian  —  Laplacian  Replicator  Unbiased  Laplacian 


Figure  9:  Different  views  of  the  structure  of  the  House  of  Representatives  network  (centrality  and  communities) 
resulting  from  the  dynamics  specified  by  different  operators. 
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Generalized  Centrality 
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(c)  Spectral  partition  using  Normalized  Laplacian 


(d)  Spectral  partition  using  Replicator 


Dynamics  —  Normalized  Laplacian  —  Laplacian  Replicator  Unbiased  Laplacian 


Figure  10:  Different  views  of  the  structure  of  the  Political  Blogs  network  (centrality  and  communities)  resulting  from 
the  dynamics  specified  by  different  operators. 
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3.4.2  Generalized  Conductance  and  Spectral  Clustering. 

With  generalized  Laplacian  framework,  we  can  also  define  a  generalized  conductance  that  measures  the  quality  of  a 
subset  S  (of  vertices)  as  a  potential  community.  This  measure  is  used  in  spectral  clustering  to  find  optimal  cut  of  the 
network  into  subgraphs  representing  different  communities. 


hc{S)  = 


cut(5,  S) 

min  (vol/:(S'),  vol/:(5)) 


_ ^ies,jes  _ 

min  (E.S  dwin,  dwin) 


(7) 


Notice  that  we  have  generalized  the  volume  measure  of  a  set  S'  C  1/  to  yoIc{S)  =  dwi'^i,  which  is  the  sum  of 

generalized  centrality  of  member  vertices. 

With  generalized  conductance  we  can  extend  the  classic  Cheeger’s  inequality,  which  relates  the  second  smallest 
eigenvalue  of  the  normalized  Laplacian  to  the  conductance  of  the  best  bisection  in  the  network,  to  their  generalized 
counterparts  under  our  framework. 


(min <  Ai  <  2  min (8) 

These  theoretical  results  eventually  lead  to  efficient  spectral  clustering  algorithm  for  detecting  communities  asso¬ 
ciated  with  different  dynamics.  Our  framework  also  paves  the  way  for  efficient  local  graph  partitioning.  Last  three 
columns  of  the  figures  illustrate  how  generalized  conductance  of  the  good  partitions  found  by  our  algorithm  differ  in 
four  special  cases  on  the  same  network. 

Figure  and  show  the  centrality  profile  and  community  profile  obtained  by  different  operators  on  two  bench¬ 
mark  networks.  The  first  network.  House  of  Representatives,  shows  the  network  of  congressmen,  where  a  link  rep¬ 
resents  a  co-vote  on  a  bill.  The  second  network,  political  blogs,  shows  hyperlinks  between  blogs.  Centrality  profile 
gives  generalized  centrality  for  each  node  given  a  dynamic  operator.  The  community  sweep  profile  in  the  second 
column  gives  generalized  conductance,  for  a  potential  community  of  size  k  using  our  spectral  clustering  algorithm  10. 
To  improve  visualization,  both  are  reordered  on  the  x  axis  and  rescaled  on  the  y  axis.  The  visualizations  in  the  last 
two  columns  are  the  best  bisections  found  by  our  algorithm  using  the  indicated  special  case.  As  can  be  seen  from 
the  figures,  the  centrality  profiles  under  different  operators  are  very  different,  resulting  in  alternate  visions  of  who  the 
important  nodes  are.  In  addition,  community  sweep  profiles  are  also  very  different,  and  lead  to  different  partitions  of 
the  same  network  into  communities. 

3.4.3  Relevant  Publications. 

•  R.  Ghosh,  K.  Lerman,  S.-H.  Teng,  and  X.  Yan.  The  interplay  between  dynamics  and  networks:  Centrality, 
communities,  and  cheeger  inequality.  In  Proceedings  of  the  20th  ACM  SIGKDD  Conference  on  Knowledge 
Discovery  and  Data  Mining  (KDD’2014),  2014. 


4  DELIVERABLES 

Over  the  course  of  the  project,  we  delivered  code  to  AFRL.  The  first  deliverable  was  Netkit,  a  network  analysis  toolbox 
that  includes  several  centrality  computation  methods. 

The  second  deliverable  was  link  prediction  code  that  takes  as  input  a  network  and  returns  a  ranked  list  of  links 
more  likely,  but  not  yet  observed,  links. 

5  PUBLICATIONS 

The  work  conducted  over  the  course  of  this  project  resulted  in  seven  journal  publications  and  many  more  papers  in 
conference  proceedings. 
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LIST  OF  SYMBOLS,  ABBREVIATIONS  AND  ACRONYMS 

A  adjacency  matrix  of  a  network 

D  diagonal  out-degree  matrix 

Dw  out-degree  matrix  of  the  reweighted  network 

T  node  scaling  (time  delay)  factor  matrix 

I  identify  matrix 

C  centrality  matrix 

cr  centrality  vector 

a  parameter  in  centrality  calculations  setting  the  length  scale  of  interactions 

la  AC  limited-attention  Alpha-Centrality 

CN  link  prediction  heuristic:  number  of  common  neighbors 

JC  link  prediction  heuristic:  Jaccard  score 

A  A  link  prediction  heuristic:  Adamic- Adar  score 

CS  link  prediction  heuristic:  conservative  score  (random  walk) 

ICS  link  prediction  heuristic:  limited-bandwidth  conservative  score 

NC  link  prediction  heuristic:  non-conservative  score  (epidemic) 

INC  link  prediction  heuristic:  limited-bandwidth  non-conservative 

hCS  link  prediction  heuristic:  hybrid  conservative 

hies  link  prediction  heuristic:  hybrid  conservative 

hNC  link  prediction  heuristic:  hybrid  non-conservative 

hlNC  link  prediction  heuristic:  hybrid  limited-bandwidth  non-conservative 

C  Laplacian  matrix  of  the  network  specifying  dynamics 

j^SYM  normalized  symmetric  Laplacian 

random- walk  Laplacian 
consensus  Laplacian 

he  conductance  of  the  network  with  respect  to  dynamics  C 

Yo\c  volume  of  the  network  with  respect  to  dynamics  C 


Approved  for  Public  Release;  Distribution  Unlimited. 
18 


